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ABSTRACT 



A virtual intercom method implemented by a computer- 
based electronic assistant and including the steps of receiv- 
ing a voice oommand from a first party, wherein the voice 
command identifies a user selectable one of a plurality of 
subscriber parties; responding to the voice coounand by 
generating a greeting to the first party, wherein the greeting 
is an audio recording in the voice of the identified sub- 
scriber; receiving a message gcocralcd by the first party for 
the identified subscriber; and storing the received message 
fix later playback to the identified subscriber. 

2 Claims, 6 Drawing Sheets 
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A CONVERSATIONAL TELEPHONE MESSAGING SYSTEM 

Chris Schmandt and Barry Arons 
Architecture Machine Group 
Massachusetts Institute of Technology 



The Phone Stave is a personal, integrated . 
telecommunications management system, combining 
diveise message functions in a single user interface on a 
small general purpose computer. This paper will focus 
on the audio components of that interface (a related : $ 
publication.emphasizes the graphical interface [4j). J§ 

The Phone Slave is an intelligent answering machine, : | 
conversing with callers to format messages and relaying 
peisonal greetings to identified parties. Its owner can 
access these voice messages as well as electronic mail: 
via speefeh recognition or Touch Tones over the phon§* 
'network.; j Access to both : incomihg and outgoing ; ~|| • 
messages* an on line directory, and autodial features are 
also provided by a toucn-sensitive color monitor. ^1 

Conversational Answering Machine & 

- :.. : • 

When the machine answers the phone, it quickly '$ 
engages the caller in a conversation, playing back stored 
speech segments, while recording the caller's . ^ . 
responses. By asking questions such as "Who's ^, ; 
calling?" and, "At what number can you be reached?"}^ 
the machine formats message segments for its.owner.^f ; 
These questions are phrased both for completeness ojf 
the message, and to enable response to queries such as 

"Who left messages?" by r the owner. ' 

.. . . . 54 

machine: Hello. Barry's telephone speaking, who'^j 
calling please? * ^ 

caller: Marc Spehlmann. ' igj 

m: lV/)ai is this in reference to? 
c: Speech Compression. 

m: He s not available at the moment, but he left this^j. 
message: <Barry's voic i m having lunch at 
primo S. i Ghoul 0 BE back BY i.oo At what number 
can he reach you? 

c: 258 6681 

m: When will you be there? 

c: I'll be here all night. 

m: Can I take a longer message? 

c: I have the CVSD hardware running on the IBM PC, 

I'll give you a demo as soon as Dave Chen gets the 

software running, 
m: /'// be sure he gets that. Thanks for calling. 

Goodbye. 
c: Bye. 



This well timed series of questions guide the caller, who 
may not be prepared to simply recite a message to a tape 
recorder': The conversation can take a number of 
branches to exchange the maximum amount of 
information between both parties (figure 1). ; " 

An adaptive pause detection scheme is used to 
determine whenthe caller has finished answering each > 
question. The goal is a snappy response without cutting 
off those who speak haltingly. The end of utterance 
timeout for pause detection is initially set to 1 .25 : 
seconds. ¥ lt intermediate length pauses of greater than ^ 
.75 secSti^ timeout^ 
is gradually lengthened to 2 seconds; : lU ' - 1 ;; ^ s 

It is essential for a segmented playback scheme that the -4 
reply to each question be specif ic and reasonably brief, 
To counteract callers whOiramble or try.to answer more h 
than one question at a time, each response has an Vt 
associated maximum length. If the caller exceeds it, the r , 
machine, interrupts in a louder voice, politely asks the T V- 
caller tb ; be precise and repeats the question; :. 1.;,. ;/ ^ 

While recording, all icply segments iire ppwer . , v \ , 
normalized to be drapnroxiY^ ^ 
playback.;'^ ^| 
when speech signal levels aro hot inuhh higher than f| 
background ooise during ^ausp. detection.. At this pointy 
the machine asks to the caller to speak up and restarts. $ 

A number of possible voices and dialogs were 
experimented with. While not attempting to deceive 
callers into believing they are speaking with a person, it 
is important that they realize this is not a typical 
answering machine. The machine identifies itself as 
"Uairy's telephone speaking" in a pleasant voice which 
is clearly different from the owner. The owner's voice is 
heard only to deliver the outgoing message, which is of 
course changed frequently. 

Caller Identification 

The answer to the first question. "Who's calling?", is 
processed by a speech recognizer simultaneous with 
recording (figure 2). If a match on the voice pattern of a 
frequent caller is obtained, the conversation branches, 
with the caller being greeted by name and playing a 
personal recording for that specific caller. 



Manuscript received June II, 1984. 



xxi 



As a backup or possible substitute for speech 
recognition, the caller may answer the "Who's calling?" 
question by keying in her own phone number with Touch 
Tones. A familiar caller expects to be greeted by name 
after identifying herself. If. instead, the machine just asks 
"What's this m reference ro?". she can still key in an ID, 
at which point the machine apologizes and delivers any 
personal messages. 

This branch of the conversation tree asks whether the 
caller can be reached at her usual number, informs her if 
her last message has been heard by the owner, and it not 
says u ll you 'd like to leave a (another) message, VII j. 
record it now. otherwise hang up and I II tell him you 
called (again)". 

The machine encourages participation;^ , , t 

' vaiiety of options m message type and j6spoi»dmg^| 
personally to all callers. Most importanU^ 
61 a specif jc messagewth greater contenijhah the|f ; 
generic. outgoing I can t answer my phpr\cv nght nc^." , 
'./A -dialog. may occur through a series of calls by the^yner 
and a tnend. although the parties never connect directly. 

Even a previously unknown caller may bonc(il fromjjhis 
treatment.. Alter a call by an uiirecogniz^peVson i$ 
, t .finished; the digitized vc'ce of then answer, to./HVno|p . 
;}"cri//i>ii;?V' js'us«xl ; t.o Vaiiva new template ij^Ui^spe^h ^ 
/.'recognizer. On callback. the> will be informed^ 
■r^whethcr^he owner has heard their f niessaqe. ! rcceiyg, any v 
■personal reply be asked if they wish tp leayie another 
message. ."• ■*^Vi£?V :.r'M-\':- ! "--. 

Message Retrieval 

Messages are recorded as a series of distinct audio 
t segments, to faciiitatemessage access. £he machjne 
niay'playback individual responses, or a sieries.of || , 
■ "» responses to indicate who left messages: or the enttre ; 
; l content ol a single message (figure 3). Local v acce& is 
, * t# ; a toudh sensitive display (figure 4), with remote^ 
■** access by speech recognition or DTMF tones. 

, j lie owner. may access all message components ^ 
remQtcly over a phone! connection, leavo a new personal 
reply for any caller, or request the time of a call or the 
caller's phone number from the directory. The machine 
switches between a command mot/e. during which the 
owners speech is being interpreted though the 
recognizer; and a record mode, during which replies are 
being recorded until a significant pause is detected. 

owner: Mi thiti is Barry. 

m: Hi llin\. y <mi liavi* Minn' n.'w mos sa»j« A s . 
,,hc t i i »iti MANC SPOU.MANN 

o: What's it about? 
m: srti ch^x mpmkssion 
o: What's the message? 

m: M«AVl !*M O'UM'.VHpV.APl MUMMING ON IHl'lBM PC. 
I LI Gl'.'r >* it I A 01 Mi > AS SOON ASDAVL CHEN GETS 
THl IV »HWAH» MUMMING 



o: W* ft messages? 

m: mark spehlmann. Chris, Walter. 

o: Next Message. 

m: One from Chris, it's about -Japanese 

Video lape." 
o: What dhesay? 

m: We will re-shoot Phone Slave 

videotape, in Fnglish and Japanese... 
o: When was it? 

m: Message received yesterday afternoon 

at three, 
o: Take a reply, 
m: Ready to record. 

o: Hi Chris, it should only take about 3 days to get the 
Phone Slave software to run in Japanese. Lets 
schedule taping with Shigeru tomorrow. v 

m: Stopped recording, mail was sent. ; yr- i-t ' ; 

,o: Next Message. ■ . "Vfc; 

. ; m: Oni-vf rom Wal ter Bender , this a y 

long message, its about "Ariti L Al iase^ 
: * : * Mine. Routines. t *° " 1 : f;i " ''^^/ : ^\ f 
o: Goodbye. t; ' ''•■■' r ^- 1 '' *• 

m: Goodnight. 

■;■ Although speech recognition over telephone Jines [1, 3] 
is improving, accuracy degrades sicjiVificantly with the^' 

wnoiso leyels hequcntlV fourid oh trunk tin0S; A '^a limits ■ 
backup: a significant subset of theCOmmw^ 

' prcivicled through Touch'Tories. A small bu^T; ■ v 
: ■; ;com^ fl. C 

- *' %icbt£d' (f igCi'r e 5): in the belief that this woUld be* more] n 
useful tharr the fuliset'of 'command implemented witfv|| 
* either multiple stroke entries or sub menus'. * 



Unified Electronic Mail 



Electronic ^ nfai I messages are integrated with voice _ 
messages; and may be vievyed on the scre§n^or hear<^ 
over the rji VorSe with' a - tdx t to * speech ^nthesizer. , On|he 
prototype ^ system in useln our laboratory. rt^ quiteV 
comr^bn'tq receive' both forms from the same person,; ! ; 
' and' the£ are grouped together* appropriately for easy 
access, This allpws\text.rcpliesjo.vpice t .me^ages and 
vice-versa. ' *' * v 

Several limitations ol synthetic speech have been 
addressed., The first is intelligibility, which may be 
disappointingly low. As a listener is exposed to a 
particylar'syhthetic speech peripheral, and becomes 
accustdmod to it. misunderstanding errors decrease 
significantly, much as one improves in ability to 
understand a regional or foreign accent (5). 
Pronunciation is improved through an on-line exception 
dictionary, translating names, local jargon, or other 
confusing words into an alternate spelling for correct 
pronunciation. 

A second intelligibility aid is a repeat command, which 
replays text, starting from the previous sentence, at a 
slower rate. The second invocation of repeat spells the 
sentence in question letter by letter. 
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Even though word by word understanding may become 
fairly high with usage, this takes some effort, such that a 
listener is less likely to comprehend the meaning of the 
sentence or paragraph being spoken [2]. To avoid clutter 
in the speech channel and minimize memory demands, 
header information, such as the date and time of 
message delivery, Is withheld until requested. With 
similar Intent, messages are grouped according to the 
sender, so all messages from a particular source are 
played sequentially. 

A voice reply to a text message may be taken, in which 
case mail is sent informing the original sender that a 
voice message awaits, giving the phone number and an 
access code. The universal accessibility of the phone 
network allows speech to be transmitted anywhere, so.all 
sound storage can be local, with no assumptions about 
remote site capability:or message protocols; ^ 

• ■ ' .;»•'' -If 

"This work : fias been funded by grants trcm Atari. Inc. 1 ar| 
NTT. the Nippon Telegraph and Telephone Company. - 
Speech synthesis hardwaro was supplied by Speech 
Plus. The authors also wish to thank Marc Spehlmann ^ 
for his.d^catod wortfbuilding thu toloph.onp interface^ 
hardware. , t v 
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